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Abstract. Structural parameters are normally extracted from observed galaxies by 
fitting analytic light profiles to the observations. Obtaining accurate fits to high-resolution 
images is a computationally expensive task, requiring many model evaluations and con- 
volutions with the imaging point spread function. While these algorithms contain high 
degrees of parallelism, current implementations do not exploit this property. With ever- 
. growing volumes of observational data, an inability to make use of advances in com- 

Qh' puting power can act as a constraint on scientific outcomes. This is the motivation 

behind our work, which aims to implement the model-fitting procedure on a graphics 
^ . processing unit (GPU). We begin by analysing the algorithms involved in model evalua- 

' tion with respect to their suitability for modern many-core computing architectures like 

I GPUs, finding them to be well-placed to take advantage of the high memory bandwidth 

offered by this hardware. Following our analysis, we briefly describe a preliminary im- 
plementation of the model fitting procedure using freely-available GPU libraries. Early 
, results suggest a speed-up of around lOx over a CPU implementation. We discuss the 

I . opportunities such a speed-up could provide, including the ability to use more com- 

1^ ' putationally expensive but better-performing fitting routines to increase the quality and 

] robustness of fits. 
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1. Introduction 

Recent trends in commodity computing hardware have seen a dramatic shift first from 
rS I single-core processors to multi-core and then to accelerated platforms like graphics 

■ processing units (GPUs). GPUs were originally designed to speed up 3D graphics 

calculations for video games, but their immense memory bandwidth and arithmetic 
capabilities have seen them re-purposed for the needs of scientific computing. While 
unquestionably powerful, their radically different, massively -parallel architectures have 
shaken up the software community. Astronomy is one of many fields trying to adapt to 
these changes in computing hardware. 

While the area is still in its infancy, GPUs have already been shown to provide sig- 
nificant speed-ups across a range of astronomy problems. These include direct N-body 
simulation (e.g.. | Hamada et al.ll2009i) . adaptive mesh refinement hydrodyna mics (e.g., 
Schive et al. galaxy spectra l energy density calculations (Jonsson & Primackl 



20101). gravitational mi crolensing dBate et aLll2010h . corr elation for radio telescope s 



(e.g.. lWayth et al.ir2009t) and coherent pulsar dedispersion dvan Straten & Bailesll201Cll) . 



The approach taken in each of these cases has, however, been ad hoc in nature - the 
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transition to the GPU has been guided largely by hardware-specific documentation, 
code samples and simple trial and error. While such an approach has proven very suc- 
cessful for these early adopters, it is not clear that it will remain effective when it comes 
to more complex algorithms. Furthermore, in some cases the cost of re-implementing a 
code may be too large to gamble on a return (i.e., a speed-up) of unknown magnitude. 

In this paper we discuss the potential for accelerating the process of galaxy fitting 
using GPUs. Rather than tackling the challenge b lind, we instead use a generalised 
method based on algorithm analysis as outlined in iBarsdell et al.l (l2010 h. The galaxy 
fitting process is described in Section[2j which is followed by a full analysis of the prob- 
lem in Section [3] A preliminary implementation and results are described in Section |4] 
before our summary discussion in Section |5] 



2. Galaxy Fitting 

A common problem in astronomy is to fit analytic surface brightness profiles to obser- 
vations of globular clusters or galaxies in order to extract structural parameters such 
as the effective radius, ellipticity or integrated flux magnitude. The fitting procedure is 
typically non-linear and of high dimensionality, demanding the use of powerful optimi- 
sation routines. Many codes have been developed to perform this task, includi ng, e.g., 
isHAPE (Larsen 1999), galfit (Peng et al. 2002) and galphat dYoon et al.ll2010h . which 



use the downhill simplex, Levenberg Marquardt and Markov-Chain Monte-Carlo meth- 
ods respectively. While there is a variety of optimisation techniques in common use, 
most follow a similar pattern: 

1. Evaluate a model on a grid using the current set of parameters (guessed initially). 

2. Convolve the model with the point spread function (PSF) of the observation. 

3. Compare the model and observation. 

4. Adjust the model parameters. 

5. Check finishing criteria and return optimised parameters if complete, otherwise 
repeat from Step 1. 

Step 4 is where the specifics of a particular fitting routine come into play, while steps 
1-3 generally remain unchanged between methods. Given the pixel-counts of modern 
astronomical observations, and the fact that most fitting routines require a very large 
number of iterations, the optimisation process can be highly computationally intensive. 
The quantity and quality of science results are thus tied to the available computing 
power and a code's ability to take advantage of it. 

Computationally-limited problems like galaxy fitting are ideal candidates for ac- 
celeration. The fact that steps 1-3 are common to a large number of fitting routines 
allows us to study the problem with significant generality. Additionally, the image- 
based nature of the operations immediately suggests suitability for GPUs. 



3. Algorithm Analysis 



In order to determine whether galaxy fitting is a suitable application for GPU accelera- 
tion, we use an approach based around algorithm analysis as described in IBarsdell et al. 
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(I2010h . We begin by identifying known algorithms within the steps in the problem out- 



line presented in Section |2l 

1. Evaluation of a model on a grid is an example of a transform algorithm. 

2. Convolution with the PSF is best done in Fourier space, requiring the fast Fourier 
transform (FFT) and regular transform algorithms. 

3. Comparison of a model with an observation typically involves computation of 
the "sum of squared differences", which involves the transform and reduce al- 
gorithms. 

Note that the algorithms required during optimisation of the model parameters in step 
4 will depend on the chosen fitting routine. 

It is thus seen that the fitting procedure makes use of only the transform, FFT and 
reduce algorith ms, all of wh ich are known to be very efficient on many-core architec- 



tures like GPUs (iBarsdell et a l. 2010). 



The next step in the analysis is to look at the global characteristics of the com- 
putation. Both the transform and reduce algorithms have a work complexity of 0{N), 
indicating that a constant number of operations is performed for every image pixel. 
The FFT algorithm has a work complexity of 0(N log A^), indicating that C?(log N) op- 
erations are performed for each of the N image pixels. The convolution step is thus 
expected to consume the majority of the processing time for large images. 

The fitting procedure for an image of pixels therefore requires reading 0{N) val- 
ues, repeatedly performing 0(N log N) arithmetic operations 0{Niter) times, and writing 
out 0(1) optimised parameter values (where Nuer is the number of iterations required 
to obtain a good fit). In the best case scenario then, the problem has a ratio of memory 
to compute operations, or arithmetic intensity, of 0(Nite,-logN). 

The ability to achieve this arithmetic intensity depends on the memory access pat- 
terns of the component algorithms. Because the FFT algorithm requires an all-to-all 
communication pattern (i.e., each output value depends on every input value), it is nec- 
essary to have the entire image globally accessible during each iteration of the fitting 
routine. This rules out storing the image data in very small caches (e.g., the shared 
memory on NVIDIA GPUs) between iterations. However, there is generally more than 
enough main memory on a GPU to hold an entire image. This means that the data may 
be left on the device for all Nuer iterations, with no need to go back to the host's memory 
or disk until the final results have been obtained. The limiting factor will instead be the 
internal memory bandwidth of the device. This is a good result, as current GPUs have 
significantly more memory bandwidth than CPUs, and one can expect a corresponding 
speed-up. 



4. Implementation Results 

Given the positive results of the algorithm analysis in Section [3j a prototype imple- 
mentation of the galaxy fitting problem was deemed worthwhile. NVIDIA's CUDAQ 
platform was used to interface to the GPU. FFTs were performed using the CUFFT 



'http://www.nvidia.com/object/cuda_home_new.html 
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librar>0, and the Thrush C++ library was used for its efficient implementations of the 

transform and reduce algorithms. 

Given the subtleties of mature codes Uke galfit ( Peng et aLll2002h . performing an 
accurate comparison with our prototype GPU code is not yet possible. Preliminary re- 
sults, however, suggest a speed-up in the main computations of around lOx when using 
a single NVIDIA Tesla €1060 GPU versus an Intel Nehalem CPU. Profihng results 
also indicate that the GPU hardware is being used efficiently by all of the algorithms in 
the code. These results support the conclusions of our analysis in the previous section. 



5. Discussion 

Many-core architectures like GPUs are now an important part of the computing land- 
scape. While many software challenges remain, a generalised approach to analysing 
astronomy problems has proven very useful in tackling new GPU codes. 

Galaxy fitting looks to be a promising application of GPU technology. Signifi- 
cant speed-ups present the opportunity to perform faster fits, which may be crucial for 
the next generation of galaxy surveys. Alternatively, the additional processing speed 
could be fed back into the fitting routine to provide fits of much better quality in the 
same length of time, helping to overcome common problems such as local minima and 
unphysical results. 

While useful as a profihng tool, our prototype GPU code requires significant fur- 
ther development before it can be considered a viable alternative to other galaxy fitting 
codes in use by the astronomy community. Future work will address such development. 

Given the generality of our analysis, it is likely that other fitting problems in as- 
tronomy would also benefit from GPU acceleration. If one allows flexibility in the 
dimensionality of the problem, procedures such as spectral line or cube fitting become 
possible. Such problems will also be the subject of future work. 
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